Multi-task Joint Learning Model (MJLM) was proposed to solve the performance improvement bottleneck problem caused by the separation of viewpoint-invariant feature and view transformation method in the existing cross-view geo-localization methods. MJLM was made up of a proactive image generative model and a posterior image retrieval model. In the proactive generative model, firstly, Inverse Perspective Mapping (IPM) for coordinate transformation was used to explicitly bridge the spatial domain difference so that the spatial geometric features of the projected image and the real satellite image were approximately the same. Then, the proposed Cross-View Generative Adversarial Network (CVGAN) was used to match and restore the image contents and textures at a fine-grained level implicitly and synthesize smoother and more real satellite images. The posterior retrieval model was composed of Multi-view and Multi-supervision Network (MMNet), which could perform image retrieval tasks with multi-scale features and multi-supervised learning. Experimental results on Unmanned Aerial Vehicle (UAV) dataset University-1652 show that MJLM achieves the Average Precision (AP) of 89.22% and Recall (R@1) of 87.54%, respectively. Compared with LPN (Local Pattern Network) and MSBA (MultiScale Block Attention), MJLM has the R@1 improved by 15.29% and 1.07% respectively. It can be seen that MJLM processes the cross-view image synthesis and retrieval tasks together to realize the fusion of view transformation and viewpoint-invariant feature methods in an aggregation, improves the precision and robustness of cross-view geo-localization significantly and verifies the feasibility of the UAV localization.